How to Search the Internet Archive Without Indexing It
نویسندگان
چکیده
Significant parts of cultural heritage are produced on the web during the last decades. While easy accessibility to the current web is a good baseline, optimal access to the past web faces several challenges. This includes dealing with large-scale web archive collections and lacking of usage logs that contain implicit human feedback most relevant for today’s web search. In this paper, we propose an entity-oriented search system to support retrieval and analytics on the Internet Archive. We use Bing to retrieve a ranked list of results from the current web. In addition, we link retrieved results to the WayBack Machine; thus allowing keyword search on the Internet Archive without processing and indexing its raw archived content. Our search system complements existing web archive search tools through a user-friendly interface, which comes close to the functionalities of modern web search engines (e.g., keyword search, query auto-completion and related query suggestion), and provides a great benefit of taking user feedback on the current web into account also for web archive search. Through extensive experiments, we conduct quantitative and qualitative analyses in order to provide insights that enable further research on and practical applications of web archives.
منابع مشابه
Full Text Search of Web Archive Collections
The Internet Archive, in cooperation with the International Internet Preservation Consortium, is developing an open source full text search of Web archive collections. Web archive collection search presents the usual set of technical difficulties searching large collections of documents. It also introduces new challenges often at odds with typical search engine usage. This paper outlines the ch...
متن کاملAuto-Explore the Web – Web Crawler
World Wide Web is an ever-growing public library with hundreds of millions of books without any central management system. Finding a piece of information without a proper directory is like finding a middle in a haystack. Various search engines solve this problem by indexing an amount of the complete content that is available in the internet. For accomplishing this job, search engines use an aut...
متن کاملContent Based Search in Web Archives
The widespread use of Internet as potentially useful and important data repository has led to the proliferation of Internet usage in search of valuable information to be used for important decision making. As the amount of the data stored and made available in the Internet grows, it becomes extremely necessary to create and maintain an Internet archive as a data repository for the purpose of su...
متن کاملیک روش مبتنی بر خوشهبندی سلسلهمراتبی تقسیمکننده جهت شاخصگذاری اطلاعات تصویری
It is conventional to use multi-dimensional indexing structures to accelerate search operations in content-based image retrieval systems. Many efforts have been done in order to develop multi-dimensional indexing structures so far. In most practical applications of image retrieval, high-dimensional feature vectors are required, but current multi-dimensional indexing structures lose their effici...
متن کاملA Comparing between the impacts of text based indexing and folksonomy on ranking of images search via Google search engine
Background and Aim: The purpose of this study was to compare the impact of text based indexing and folksonomy in image retrieval via Google search engine. Methods: This study used experimental method. The sample is 30 images extracted from the book “Gray anatomy”. The research was carried out in 4 stages; in the first stage, images were uploaded to an “Instagram” account so the images are tagge...
متن کامل